Power Law Distributions in Class Relationships 
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Abstract 

Power law distributions have been found in many natural 
and social phenomena, and more recently in the source code 
and run-time characteristics of Object-Oriented (00) sys- 
tems. A power law implies that small values are extremely 
common, whereas large values are extremely rare. In this 
paper, we identify twelve new power laws relating to the 
static graph structures of Java programs. The graph struc- 
tures analyzed represented different forms of 00 coupling, 
namely, inheritance, aggregation, interface, parameter type 
and return type. Identification of these new laws provide 
the basis for predicting likely features of classes in future 
developments. The research in this paper ties together work 
in object-based coupling and World Wide Web structures. 



1 Introduction 



Power law distributions have been found in many natural 
and social phenomena. A power law implies that small val- 
ues are extremely common, whereas large values are ex- 
tremely rare. For example, incomes, earthquake strengths, 
city sizes and word frequency all follow power law dis- 
tributions - there are many small tremors, but only a few 
large earthquakes. The power law distribution is strongly 
connected with Zipfs-law and the Pareto distribution, often 
known as the 80:20 rule (TJ. 

We would expect a power law to apply to the size of classes 
in object-oriented systems. Size in this sense is defined 
in terms of the number of methods, constructors and other 
class features. This hypothesis is partially supported by pre- 
vious research into key classes (5). 

The existence of a power law distribution in a network 
implies a scale-free behaviour. This means that the net- 
work lacks a "characteristic length scale" so that, like frac- 



tals, when suitably magnified, small bits of it resemble the 
whole. Whichever range of values is examined, the propor- 
tion of small to large values remains the same 1 1 1. 

Recently, there has been a great deal of interest in the power 
laws visible in the structure of the World Wide Web |2|. 
These include the number pages on web sites, the number 
of links to a given website, the PageRank 1 values of nodes 
and the frequency with which users visit pages I17l l2"l. 

The indegrees and outdegrees of individual web pages are 
also subject to power law distibutions (8). In a Web con- 
text, indegree refers to number of pages linking to a given 
page and outdegree refers to the number of pages referenced 
from a given page. During the development of our Autodoc 
system for assisted navigation of program documentation 
1211 . we found that the pages in the Javadocs were subject 
to the same laws. We thus hypothesized that the relation- 
ships were due to power law distributions in the underlying 
code structure. 

Our motivation for the research described in this paper is 
to discover patterns and relationships which can explain the 
structure of source code at a low level of abstraction. Iden- 
tifying such patterns allows us to predict, by extrapolation, 
the consequences of developing larger and more complex 
software. For example, we could predict how many classes 
might contain greater than a hundred methods in a set of 
classes ten times larger than the Java Developers Kit (JDK). 
Alternatively, we could predict the maximum number of 
constructors of any class in that system. This may have im- 
plications for software maintenance and comprehensibility 
in terms of time spent and effort expended 1141 . 

The identification of power law distributions in source code 
enables us to categorize the coupling graphs as belonging 
to a class of scale-free topologies. Similar topologies have 
been found in the Web [-8], the Internet 1101 and in some 
food webs 1111 . Techniques to store, manipulate and de- 
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scribe such topologies may be re-applied to handle coupling 
graphs. 

A further motivation is to enable models of code de- 
velopment that will allow developers to create synthetic 
code bases containing large numbers of computer-generated 
classes. For example, given an appropriate means of gener- 
ating synthetic data, a developer could generate a data set of 
a much larger number of classes. This would enable them 
to test the consequences of developing a large system before 
development begins. 

Finally, our work has implications for the graph traversal 
algorithms used in reachability analysis and garbage col- 
lection. Just as internet networks are robust against ran- 
dom removal of nodes |4|, it is likely that random removal 
of classes will have little effect on the proportion of code 
which can be reached and thus executed. 

The remainder of this paper is organized as follows. Sec- 
tion[2]describes related work in discovering power law dis- 
tributions and scale-free characteristics of software. In Sec- 
tion |3 we describe our analysis techniques and present the 
results in Section |4] Section [5] gives our conclusions and 
ideas for future research. 



2 Related Work 



There has been substantial work on power law distributions 
in natural phenomena and over recent years, in the evolution 
of the web. Only recently has attention turned to the power 
law distributions found in program code and, in particular, 
those relating to Java software. 

O'Donoghue et al. have performed a run-time analysis of 
Java bytecode sequences obtained using a customized ver- 
sion of the Kaffe JVM 1151 . Their experiments showed that 
the frequencies with which consecutive pairs of instructions 
are interpreted by the virtual machine follows a power law. 

Potanin et al. have conducted experiments using a query- 
based analysis tool called Fox, which is an enhanced ver- 
sion of Bill Foote's Heap Analysis Tool (HAT). Their re- 
search has confirmed power laws in the indegree and out- 
degrees of the run-time object graphs of several programs 

m. 

Valverde et al. have shown that the indegree and outdegrees 
of nodes in a network of class diagrams also follow power 
laws, leading to a scale-free network topology similar to 
that of the Web 1201 . Since these diagrams have a one-to- 
one mapping with the source code structure, the implication 
is that these laws are a feature of object-oriented program 
code. 



A major feature of the work in this paper is the analysis of 
different coupling types. A commonly-held view in soft- 
ware engineering is that there is a link between complex- 
ity in software and the understandability of that software. 
The more coupling in a system, the more complex the sys- 
tem. Too much coupling may be indicative of a poorly 
thought out design or of inadequate standards of mainte- 
nance. There is evidence to suggest that excessive coupling 
can lead to more fault-prone software 171 1121 . The large 
number of different ways of writing OO code means that 
accurately capturing, categorising and analyzing the differ- 
ent forms of coupling is a difficult task to undertake. A 
comprehensive framework for measuring coupling in OO 
systems is described by Briand et al |6). 

Our research shows that even when the network of classes 
is decomposed by coupling type, power laws still prevail. 
This has lead to the identification of twelve distinct power 
law distributions. 

3 Analysis Techniques 

As part of this research, a system called AutoCode has been 
developed for indexing Java source code. AutoCode works 
by using a custom doclet which extends the Javadoc pro- 
gram and allows easy access to the code structure. We used 
the AutoCode system to generate graphs for each of five 
coupling types - Inheritance, Interface, Aggregation, Pa- 
rameter Type and Return Type. An illustration of how these 
graphs can be derived from source code can be seen in Fig- 
ure ^ To identify the power laws, we then performed sta- 
tistical operations on these five graphs. With the exception 
of inheritance, we need to consider the relationships from 
two perspectives. For example, from the perspective of the 
interface and of the implementing class. We do not need 
to consider the number of superclasses because Java classes 
only ever have one superclass. The number of methods, 
constructors and fields in each class were also studied. 

Data was collected from three large Java systems: 

1 . The core Java class libraries shipped with the Java De- 
velopers Kit (JDK) provide implementations of com- 
mon functions required for many programs. Version 
1.4.1 of the JDK contains 1 400 000 lines of code 
spread over 6 000 classes. 

2. Apache Ant is a Java-based build tool. It behaves in 
a similar way to make but uses XML-based configu- 
ration files, which define various tasks to be executed. 
The source code for version 1.5.3 of Apache Ant con- 
tains 145 000 lines of code spread over 500 classes. 

3. Tomcat is the servlet container used in the offi- 
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interface StringReader { 
String readString(); 

} 




abstract class CharSequence { 
intgetLengthO; 
dppcnu^oLnrig duuriic ), 

} 




class StringFileReader 

implements StringReader { 
String lastString; 

StringFileReader(String filename) { } 
String readString() { } 

} 


► 


class String extends CharSequence { 
append(String addme); 

} 






Interface 




Inheritance 




Aggregation 




Return Type 




Parameter Type 



Figure 1. Illustration of coupling types and their graph representations. 
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cial reference implementation for Java Servlets and 
JavaServer Pages. The source code for Jakarta Tom- 
cat version 4.0 contains 150 000 lines spread over 370 
classes. 

To identify the power laws we perform linear regression on 
log-log data plots. The number of occurrences, y, of a value 
of magnitude x is given by the equation y = Cx~ a which 
implies that log(y) = log(C) — a\og(x). Hence, the power 
law can easily be identified by a straight line with gradient 
—a on a log-log plot. Because of significant clustering of 
data points near the x-axis, regression on these plots leads to 
skewed results. To prevent this, the values must be grouped 
into buckets of exponentially increasing size |2 |. The loga- 
rithm of the frequency is plotted against the logarithm of the 
mid-point of each bucket. From the subsequent regression a 
more accurate exponent value, a, can be obtained than if all 
the original data points are considered. It is this value which 
allows us to predict the likely features of future systems. A 
low value of the exponent signifies a tendency towards a 
less skewed distribution. 

4 Results 

4.1 Methods, Fields and Constructors 

The majority of this study concerns coupling relationships 
between classes. However, three power laws were identified 
without type information. These relate to the fundamental 
building blocks of classes - the number of fields in each 
class, the number of methods in each class and the number 
of class constructors. Figure |2] shows log-log plots high- 
lighting each of these relationships. 

For the distribution of the number of methods, the expo- 
nents are 1.168, 1.104 and 0.734 for JDK, Ant and Tom- 
cat, respectively. This implies that in the JDK there is a 
higher proportion of classes with very few methods when 
compared with the other two systems. This might imply 
fewer key classes in this system. For the distribution of 
the number of fields, the exponents are 1.108, 0.988 and 
0.998 for JDK, Ant and Tomcat, respectively. The differ- 
ence in the magnitude of the exponents would indicate no 
strong relationship between the number of methods and the 
number of fields. It could be imagined that a large num- 
ber of fields implies a larger number of methods to operate 
on those fields. Based upon our obvservation, we hypoth- 
esize that it is infeasible to predict the number of methods 
from the number of fields and vice-versa. This hypothesis is 
supported by correlations between the number of methods, 
fields and constructors (Figure [3}- The correlation matrix 
in Figure|4]shows that no strong correlation exists between 



any of these measures. 

For the distribution of the number of constructors, the expo- 
nents are 3.560, 3.589 and 3. 1 12 for JDK, Ant and Tomcat, 
respectively. This implies that classes with a large number 
of constructors are rarely found in systems of this scale. For 
example, the JDK system contains only three classes with 
more than ten constructors. Previous work into refactor- 
ing of constructors found similar evidence for five medium- 
sized Java systems 1141 . Only one class was found to have 
ten constructors. This class was part of the Swing library. 





Methods Fields Constructors 


Methods 
Fields 

Constructors 


1 

0.216 1 

0.215 0.0827 1 



Figure 4. Correlation matrix for class mem- 
bers in the JDK 



4.2 Coupling Power-Laws 

The frequency with which classes are used as superclasses 
to other classes can be calculated by examining the distribu- 
tion of outlinks in the superclass-subclass graph. Figure [5] 
shows a bucketed log-log plot of the number of descendants 
of the classes in the JDK. The results show that the distri- 
bution follows a power law with exponent 0.906. The ex- 
ponents for Apache Ant and Jakarta Tomcat are 0.810 and 
1.310, respectively. The high value for Tomcat implies that 
more classes in that system have relatively few descendants, 
whilst a small number of classes are extended by many de- 
scendants. In other words, the functionality of the system 
is distributed more evenly than in the other two systems. In 
contrast, for the Ant system, much of the functionality is 
contained in subclasses of key classes such as Task and 
BaseParamFilterReader. Hence the functionality is 
more concentrated in fewer classes in this system. 

By using the same techniques we can show that the distri- 
bution of the number of classes implementing an interface 
follows a power law, with exponents 1.130, 1.118 and 1.636 
for JDK, Ant and Tomcat, respectively. This makes sense 
if we consider the use of interfaces as a surrogate for multi- 
ple inheritance. We would expect a similar distribution for 
interface implementations as for subclasses. 

The distribution in the number of interfaces implemented 
by a class also follows a power law, with a much higher 
exponent of 3.663, as can be seen from Figure|6] This ex- 
ponent was calculated for the JDK. Insufficient data was 
available to calculate the exponents for the other two sys- 
tems. This result can be explained by virtue of very few 
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(a) Fields 



(b) Methods 




Figure 2. Log-log plots showing power law distributions in the number of (a) fields, (b) methods and 
(c) constructors of classes in the JDK class libraries. 



5 



(a) Fields vs. Constructors 



(b) Methods vs. Constructors 




(c) Methods vs. Fields 

Figure 3. Log-log plots showing the relationships between (a) the number of fields and the number 
of constructors, (b) the number of methods and the number of constructors and (c) the number of 
methods and the number of fields for classes in the JDK. 



6 




Log(N umber of Subclasses) 



Figure 5. Log-Log plot showing a power law 
distribution in the number of subclasses of 
each class in the JDK class library. 



classes implementing a large number of interfaces. Those 
that do implement a large number of interfaces tend to del- 
egate the responsibility for the methods of these interfaces 
to members of the same interface. 

Two further power law distributions can be seen in the re- 
lationship between classes as member variables. The first, 
a power law distribution in the number of other classes ref- 
erenced as member variables within a given class. For ex- 
ample, in Figure^ StringFileReader references one 
class, String, via the field lastString. The exponents 
of the distributions are 1.051, 1.386 and 1.493 for JDK, Ant 
and Tomcat, respectively. The low value for JDK reflects a 
comparatively uniform distribution of coupling via aggrega- 
tion in this system. One explanation for the low JDK value 
may be that the roles of various packages in the system do 
not overlap and hence there are multiple focal points for ag- 
gregation, as opposed to a centralized structure. 

The second distribution is in the number of classes which 
reference a given class as a member variable. For example, 
in Figure[lJ String is referenced by one class, String- 
FileReader. The exponents of these distributions are 
1.399, 2.295 and 1.991 for JDK, Ant and Tomcat, respec- 
tively. Interestingly the JDK again has the lowest exponent 
value supporting the previous hypothesis about multiple fo- 
cal points for aggregation. 

Both of these power-laws can be seen from the plots in Fig- 
ure0 It is noticable that the values for the first distribution 
are lower than the corresponding values for the second. This 
can be explained by the tendency in object-oriented code for 
many classes to be grouped together as members of another 
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Figure 6. Log-Log plot showing a power law 
distribution in the number of interfaces imple- 
mented by classes in the JDK class library. 



class. In contrast, it is comparatively rare for a class to be 
referenced as a member in many classes. 

Four more class features were analyzed for power law dis- 
tributions, namely the indegrees and outdegrees induced by 
parameter types and return types for each of the three sys- 
tems. All showed scale-free topology. The Ant system has 
comparatively high values for all the exponents in these re- 
lationships. Inspection of the classes in this system and sub- 
sequent analysis revealed no strong correlation between us- 
age of return types and parameters. This could be consid- 
ered a suprising result, since we might expect parameters 
and return types to be linked. No obvious explanation could 
be found for the differences in exponents between the sys- 
tems. 

The exponent values for all three systems can be found in 
figures|8]|9]and[ro] The r 2 values denote Pearson product- 
moment correlation. The high r 2 values for JDK reflect the 
larger number of classes in this system. As a result, we 
would expect more consistency in the data. The r 2 values 
are relatively low but still support the theory. 

5 Conclusions and Future Work 

In this paper we have illustrated that power-law distribu- 
tions exist in object-oriented class relationships. In par- 
ticular, those related to coupling. Twelve new power-laws 
have been identified. The exponents of these power laws 
are given for the JDK (Figure |8j, Tomcat (Figure |9} and 
Ant ("Figure [Tot. One conclusion from this work is the be- 
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Logj Number of References) 



Log (Number of References) 



(a) Field members (b) Containing classes 

Figure 7. Log-Log plots showing power law distributions in (a) the number of classes referenced as 
field variables and (b) in the number of classes which contain references to classes as field variables. 



lief that these regularities are common across all non-trivial 
object-oriented programs. 

Another conclusion is that the different types of coupling 
examined are independent. This finding contradicts the hy- 
pothesis that high usage in one form of coupling can be used 
to predict high usage in another form. 

The implications of these findings are that we can use the 
data to predict the dimensions of future systems. This will 
allow us to estimate the complexity of developing and main- 
taining those systems. 

It is interesting to note that the exponents for Ant and Tom- 
cat rarely fall within the 95% confidence intervals of the 
JDK. We believe that these exponents are due to deeper 
properties of the collections. The conclusion is that whilst 
there are common properties between these systems, each 
individual system has its own unique characteristics. 

Bieman and Murdock have already shown that there is a 
large body of freely accessible source code available on the 
Web 0. In terms of future work, it would be interesting 
to verify these results using a large crawl of such data. As- 
suming that these results hold, a number of techniques can 
be brought to bear to explain the phenomena. 

In order to explain the power law in World Wide Web 
graphs, new models for its growth and evolution have 
emerged. The key to these models is a process known as 
preferential attachment in which pages which have a 
high indegree are more likely to be referred to by new links. 
This can be explained by considering a page with higher 
indegree as being more popular, more important and bet- 



ter connected. It is thus more likely to be visited by a user 
who may then also choose to link to that page. Research is 
ongoing to find methods to improve the model - for exam- 
ple, by combining preferential and non-preferential attach- 
ment 1131 . Other future work will investigate the accuracy 
with which these models can predict the structure of pro- 
gram code. 
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Figure 8. 95% confidence intervals for power law exponents in JDK. 
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